Aggregate GPU task metrics in the profiling tool by parthosa · Pull Request #2088 · NVIDIA/spark-rapids-tools

parthosa · 2026-04-27T16:45:35Z

Contributes #2020

Changes

1. New GPU task-metric aggregation CSVs at three levels

Adds long-format aggregations for the 26 GPU task accumulators emitted by the RAPIDS plugin (GpuTaskMetrics.scala). Today these are only available raw in stage_level_all_metrics.csv.

Level	Filename	Columns
Stage	`gpu_stage_level_aggregated_task_metrics.csv`	`stageId, numTasks, metricName, unit, sum, max, avg`
SQL	`gpu_sql_level_aggregated_task_metrics.csv`	`sqlId, metricName, unit, sum, max, avg`
App	`gpu_app_level_aggregated_task_metrics.csv`	`appId, metricName, unit, sum, max, avg`

Note:

Job level skipped (each Spark action is a job — rows would either duplicate the SQL row or be meaningless setup/collect jobs).
numTasks only at stage level, where it varies; at SQL/app it would be a constant per row (already in sql_level_aggregated_task_metrics.csv).
CSVs not generated when no GPU metrics are present.

Example (SQL row):

sqlId,metricName,unit,sum,max,avg
24,gpuTime,ms,86643,897,237
24,gpuMaxDeviceMemoryBytes,bytes,,10124115675,

4. Max-aggregated metrics: AccumMetaRef.METRICS_WITH_MAX_AGGREGATES extended from 4 → 9 entries. For these, sum and avg are emitted empty; only max is meaningful.

Testing

AnalysisSuite — three new tests: rows produced for GPU log + rollup math; max-aggregated metrics carry only max; empty for CPU-only log.
E2E smoke on core/src/test/resources/spark-events-profiling/gpu_oom_eventlog.zstd: all three CSVs produced; cross-level math verifies.

Emits three new long-format CSVs covering the 26 GPU task accumulators from GpuTaskMetrics.scala (gpu_stage_/sql_/app_level_aggregated_task_metrics.csv). Auto-discovery by name (gpu*, perfio.s3.*, multithreadReaderMaxParallelism); units derived from the name (Time/Wait→ms, Bytes→bytes, else count); SQL/app levels re-sum stage rows. Skips emission when no GPU metrics are present. Job level intentionally skipped (each Spark action is a job — would either duplicate the SQL row or be meaningless). Fixes NVIDIA#2020 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Partho Sarthi <psarthi@nvidia.com>

Adds appId as the leading column on gpu_app_level_aggregated_task_metrics.csv so downstream consumers can join by application without relying on the output directory path. Also bumps the copyright year on touched files to 2026 (the pre-commit hook's sed is BSD-incompatible on macOS and silently no-ops). Fixes NVIDIA#2020 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Partho Sarthi <psarthi@nvidia.com>

ArrayBuffer.flatMap returns ArrayBuffer (mutable), which no longer auto-coerces to immutable.Seq under Scala 2.13. Materialize the per-SQL row collection as Seq before passing to rollupGpuRows, and use an explicit lambda for the inner flatMap. Fixes NVIDIA#2020 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Partho Sarthi <psarthi@nvidia.com>

greptile-apps · 2026-04-27T20:18:27Z

Greptile Summary

This PR adds long-format GPU task metric aggregations at three granularities (stage, SQL, app) by mining the existing GPU accumulators in app.accumManager. The rollup math — task-weighted averages, sum propagation, and max-only handling for 9 known max-aggregate metrics — is correctly implemented and well-tested.

The two previously flagged concerns remain open: the index parameter is still unused in aggregateGpuMetricsBySql / aggregateGpuMetricsByApp, and QualRawReportGenerator still emits all three GPU labels unconditionally (unlike the nonEmpty-guarded writes in Profiler.scala).

Confidence Score: 5/5

Safe to merge; all remaining findings are P2 style/consistency issues that don't affect correctness.

All logic reviewed: rolling-average usage of stats.med is confirmed correct (AccumInfo uses running-mean semantics, with a TODO to rename it); weighted-average formula in rollupGpuRows is arithmetically sound; numTasks=0 fallback now emits a warning; CSV guards in Profiler.scala are correct. Only pre-flagged P2s remain open.

QualRawReportGenerator.scala — unconditional GPU label emission differs from Profiler.scala's nonEmpty guard.

Important Files Changed

Filename	Overview
core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAnalyzer.scala	Core of the PR: adds GPU metric aggregation at stage/SQL/app levels. `stats.med` is the rolling average (correctly named per AccumInfo TODO), `rollupGpuRows` weighted-avg formula is correct, and `numTasks=0` fallback now emits a warning.
core/src/main/scala/org/apache/spark/sql/rapids/tool/store/AccumMetaRef.scala	Extends METRICS_WITH_MAX_AGGREGATES from 4 to 9 entries; new entries are consistent with max-aggregate semantics and match RAPIDS plugin GpuTaskMetrics.
core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileClassWarehouse.scala	Adds three new case classes (StageAggGpuMetricsProfileResult, SQLAggGpuMetricsProfileResult, AppAggGpuMetricsProfileResult) with correct Optional[Long] fields and CSV serialisation.
core/src/main/scala/com/nvidia/spark/rapids/tool/views/QualRawReportGenerator.scala	GPU labels added unconditionally to the output map, unlike the nonEmpty-guarded writes in Profiler.scala; CPU-only apps will emit empty GPU entries in the qual path.
core/src/test/scala/com/nvidia/spark/rapids/tool/profiling/AnalysisSuite.scala	Three new tests cover GPU log, max-aggregated metric semantics, and CPU-only empty output; rollup math assertions are thorough.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[app.accumManager] -->|filter isGpuMetric| B[GPU Accumulators]
    B -->|calculateAccStatsForStage| C[aggregateGpuMetricsByStage]
    C -->|StageAggGpuMetricsProfileResult| D[gpuStageRows]
    D -->|groupBy stageId| E[stageMap]
    F[app.sqlIdToStages] --> G[aggregateGpuMetricsBySql]
    E --> G
    G -->|rollupGpuRows per SQL| H[SQLAggGpuMetricsProfileResult]
    D -->|rollupGpuRows all stages| I[aggregateGpuMetricsByApp]
    I --> J[AppAggGpuMetricsProfileResult]
    D --> K[AggRawMetricsResult.gpuStageAggs]
    H --> L[AggRawMetricsResult.gpuSqlAggs]
    J --> M[AggRawMetricsResult.gpuAppAggs]
    K -->|nonEmpty guard| N[gpu_stage_level_aggregated_task_metrics.csv]
    L -->|nonEmpty guard| O[gpu_sql_level_aggregated_task_metrics.csv]
    M -->|nonEmpty guard| P[gpu_app_level_aggregated_task_metrics.csv]

_{Reviews (3): Last reviewed commit: "Address greptile review on PR #2088" | Re-trigger Greptile}

Signed-off-by: Partho Sarthi <psarthi@nvidia.com> # Conflicts: # core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AggRawMetricsResult.scala # core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAggTrait.scala # core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ApplicationSummaryInfo.scala # core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/Profiler.scala # core/src/main/scala/com/nvidia/spark/rapids/tool/views/QualRawReportGenerator.scala # core/src/main/scala/com/nvidia/spark/rapids/tool/views/RawMetricProfView.scala

- Drop dead StageAggGpuMetricsProfileResult.aggregateStageProfileMetric. Stage attempts already merge upstream at the AccumInfo layer (stagesStatMap is keyed by stageId only, not stageId+attemptNumber), so a separate merge step on the case class is never invoked. Replaced the method with a comment explaining the upstream merging. - Document the numTasks=0 invariant in aggregateGpuMetricsByStage and log a warning if the stage-task metrics cache lookup misses (which would silently distort the task-weighted avg at SQL/app level). Fixes NVIDIA#2020 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Partho Sarthi <psarthi@nvidia.com>

hirakendu

LGTM, super useful!

github-actions Bot added the core_tools Scope the core module (scala) label Apr 27, 2026

parthosa self-assigned this Apr 27, 2026

parthosa and others added 2 commits April 27, 2026 09:48

parthosa requested review from amahussein, hirakendu and sayedbilalbari April 27, 2026 20:13

parthosa marked this pull request as ready for review April 27, 2026 20:13

greptile-apps Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileClassWarehouse.scala Outdated

Comment thread core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAnalyzer.scala

parthosa and others added 2 commits April 30, 2026 16:14

hirakendu approved these changes May 1, 2026

View reviewed changes

parthosa merged commit 4d88577 into NVIDIA:dev May 4, 2026
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aggregate GPU task metrics in the profiling tool#2088

Aggregate GPU task metrics in the profiling tool#2088
parthosa merged 5 commits intoNVIDIA:devfrom
parthosa:rapids-tools-2020

parthosa commented Apr 27, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Apr 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

hirakendu left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

parthosa commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Testing

Uh oh!

greptile-apps Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

hirakendu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

parthosa commented Apr 27, 2026 •

edited

Loading

greptile-apps Bot commented Apr 27, 2026 •

edited

Loading